Welcome to the ultimate arena of decision-making under uncertainty. Imagine you are in a casino, facing a row of slot machinesβthe classic n-armed bandit problem. This is the fundamental nonassociative setting of reinforcement learning, where we strip away the complexity of changing environments to focus on one burning question: How do we choose the best action when we don't know the rules?
The Interaction Framework
Reinforcement learning is a considerable abstraction of goal-directed learning. At each time step $t = 0, 1, 2, \dots$, the agent perceives a state $S_t \in \mathcal{S}$, selects an action $A_t \in \mathcal{A}(S_t)$, and receives a reward $R_{t+1} \in \mathcal{R}$. In the bandit problem, the state is irrelevant, forcing us to master the action-selection policy through pure interaction.
| Paradigm | Feedback Type | Learning Mechanism |
|---|---|---|
| Supervised Learning | Instructive (The "Right" Answer) | Pattern Matching |
| Bandit Problems | Evaluative (A Score) | Trial-and-Error Search |
The Exploration-Exploitation Dilemma
Because the agent is never told the optimal action, it faces a paralyzing conflict. It must Exploit what it already knows to secure immediate rewards, but it must also engage in Active Exploration to uncover hidden gems that might yield even higher returns in the future. This tension distinguishes the bandit problem from static optimization and is the heartbeat of adaptive intelligence.